arXiv:1508.03601vl [cs.SI] 14 Aug 2015 


Is Stack Overflow Overflowing With Questions and Tags 

Ranjitha R K and Sanjay Singh * 

August 17, 2015 


Abstract 

Programming question and answer (Q&A) websites, such as Quora, Stack Overflow, and Yahoo! 
Answer etc. helps us to understand the programming concepts easily and quickly in a way that has 
been tested and applied by many software developers. Stack Overflow is one of the most frequently used 
programming Q&A website where the questions and answers posted are presently analyzed manually, 
which requires a huge amount of time and resource. To save the effort, we present a topic modeling based 
technique to analyze the words of the original texts to discover the themes that run through them. We 
also propose a method to automate the process of reviewing the quality of questions on Stack Overflow 
dataset in order to avoid ballooning the stack overflow with insignificant questions. The proposed method 
also recommends the appropriate tags for the new post, which averts the creation of unnecessary tags on 
Stack Overflow. 

1 Introduction 

Software development is a complex activity that involves many concepts such as the usage of APIs, interface 
defined, bugs fixed, architectural designs etc. In order to work with these aspects, the programmers seek 
different sources of information to accomplish these tasks. To lift the burden of solving problems on their own, 
code examples help them in a much easier way. Thus the developers rely on different sources of knowledge 
to find answers to their problems. One of the known programming question and answer website to obtain 
solution to their problems is Stack Overflow (SO). 

Stack overflow is an online platform for developers to post their programming questions, provide answers 
to the existing questions and find solution to their difficulties faced during programming. A developer must 
add tags while posting a question to help other users to find out what the question is about. If the answer 
provided by any user gives solution to the problem faced by the questioner, that answer can be selected 
by the questioner which is called the accepted answer to that question. Different members of the site can 
vote on questions and answers. The positive votes called the upvote and negative votes called the downvote, 
which shows how helpful that quest ion/answer was for the users. 

The score of a quest ion/answer is determined by the difference between the number of up/down votes. 
Based on the different activities of each user on Stack Overflow such as posting questions or answers, voting 
on them, posting comments, etc. their reputation score increases which help them build their reputation on 
stack overflow website. Greater the reputation values, more the capabilities for a member on stack overflow 
like deletion of questions/answers, closing questions, etc. 

Each post contains meta data such as title, body, creation date, post type id, view counts, answer count, 
comment count etc. Title contains the short detail about the questions being posted by the developer. 
Body contains the complete details about the question which may also include the code snippets. As of 
early August 2010, Stack Overflow has a total of 300k registered users who asked 833k questions and the 
site served 7.8 million monthly visitors [I]. According to the stack overflow data dump of September 2013 
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provided by the Stack Exchange network, Stack Overflow stores more that 5.5M questions, 10.2M answers 
and a community counting more than 2.3M users [2J. 

Over a period of time, these websites turn into knowledge repositories of software engineering. Thus by 
analyzing and understanding this knowledge repository we obtain key insights to the use of specific tech¬ 
nologies by the developers and trend of developers discussions. This further helps us in better understanding 
the thoughts and needs of developers. The methodology used for analyzing such a knowledge repository is 
Latent Dirichlet Allocation (LDA) [3], a statistical topic modeling technique which automatically discover 
the main topics present in developers discussion. 

Once the stack overflow data set is extracted, we consider only the body part of each post as it contains 
the most text contents. We need to preprocess the text contents by discarding the code snippets, removing 
all the html tags and stop words. LDA is applied to the preprocessed data. The result of LDA is the number 
of topics to which each post is related. We have also analyzed the trends of the topics and the interacting 
patterns between the topics. This is followed by identifying related topics to the new questions entered by 
the developers and suggesting tags based on those topics. In addition to this, we analyze the quality of 
questions in our stack overflow dataset using the score value of each post. We also analyze the quality of 
questions related to each topic discovered by LDA. The main aim of identifying the quality of questions is 
to maintain the quality and growth of stack overflow website. 

2 Related Work 

General Q&A Websites: Previous work has focused on analyzing general Q&A websites based on user’s social 
interactions. Gyongyi et al. [Tj have analyzed several aspects of user behavior in Yahoo! Answers, a Q&A 
website for the general public. The authors use the number of questions and answers in each predefined 
top level category to determine the popularity of each category. Adamic et al. [5] have also analyzed 
Yahoo! Answers to cluster the top-level categories into three broader categories using both content and user 
interactions. In contrast to these efforts, instead of using existing tags, we use a statistical topic model, 
LDA, to automatically discover topics from the textual content of the posts and employ temporal measures 
to identify a topic’s popularity over time. 

Stack Overflow: Treude et al.[6] have analyzed Stack Overflow to categorize its questions and identify 
its design features [Tj. Treude et al. |6] have analyzed Stack Overflow to find topics and to categorize the 
question into distinct types, such as ’how-to’, ’discrepancy’, etc. They apply their analysis to 15 days worth 
of posts, using user-created tags to identify the topics. They manually code the questions based on a random 
sample of the data. In contrast, we apply our automated method, based on LDA, to 9 month’s worth of posts 
and use tags as a secondary basis to deduce trends of different technologies under broader topic categories. 

Mamykina et al. jl] identify the core design features that led to the popularity of Stack Overflow: the 
reputation system based on points, the strong involvement of the design team with the community, and the 
single-domain focus. Further, the authors categorize users based on their frequency of activity on Stack 
Overflow, for example, community activists, and low profile users. Instead of user activity, we focus on the 
textual content generated by the users in order to extract the major topics of discussion. 

Other Social Platforms: Works have been reported on the analysis of other datasets and social platforms, 
such as code search engine usage logs and developer blogs, to find out the topics in which developers are 
interested. In particular, Bajracharya and Lopes [?j analyze the log of a popular code search engine to 
discover major code search topics. Similar to our technique, they apply LDA on the usage log of the code 
search engine. Some of the topics found by their analysis are aligned with our findings, for example, data 
structures, files, GUI, networking, parsing/compiling, security, and string. 

However, their analysis is based on a specific group of developers, namely Java programmers. In contrast, 
our analysis takes into account developers using a myriad of different programming languages and platforms. 
Further, we analyze both questions and answers, whereas the aforementioned work analyzes only the question 
content (i.e., search queries). Moreover, their study is focused on a specific need of developers: finding 
source code examples for a particular problem. In contrast, our analysis of a community based Q&A website 
addresses developer needs from a broader perspective. 
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In another study, Pagano and Maalej [8] analyze the blogging activity of developers using topic models to 
find topics in those blogs. In particular, they analyze blogs written by developers who are active committers 
to a certain code base (e.g., Eclipse, PostgreSQL, GNOME, and Python). 

3 Methodology 

The tags are provided by the developers while posting the questions; currently Stack Overflow uses those tags 
to categorize each question posts. But the drawback of these tags are that most of the time it is erroneous 
and inconsistent which leads to Tag Explosion. To overcome this problem we use topic modeling technique 
LDA. Figure [l] shows various steps of our proposed method. First, we extract the posts from Stack Overflow 
dataset. Secondly, we preprocess the extracted posts data. Third, we apply topic modeling technique LDA 
to the preprocessed data to overcome the problem of tag explosion. Finally, we analyze the output of LDA. 



Figure 1: An overview of proposed method 


3.1 Data Extraction 

We extract the posts .xml file from the Stack Overflow data dump, which contains all the user posts i.e., 
questions and answers from 31st July 2008 to 27th March 2009 on Stack Overflow. Each individual post is 
considered as a separate document. Thus total number of documents created are 513136 out of which there 
are 111871 question posts and 401265 answer posts. We consider both question and answer posts because 
most of the text contents are present in answer posts and we need to discover the relationship between 
question topics and answer topics. 

Each extracted post includes other details like creation date, post type, ID, user defined tags, etc. For 
answer posts there is a pointer to the question it is answering and for question posts there is a pointer to all 
its answers. 

3.2 Data Preprocessing 

Once the Stack Overflow data has been extracted, we preprocess the extracted posts in four steps. First, 
we discard all the code snippets present in the posts. Since all code snippets contain similar programming 
language syntax and keywords, these do not help topic models to find useful topics. Second, we remove all 
the html tags for example < b >, < ahref = ”...” > etc. Third, we remove common English-language stop 
words such as ”a”, ’’the”, ”is” etc. Finally, we apply Porter stemming algorithm |5], which maps words to 
their base form, for example, ’’Implementation” and ’’Implementing” both get mapped to ’’Implement”. 
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3.3 Topic Modeling 

Topic modeling is a suite of algorithms which help to discover and annotate large archives of documents 
with thematic information. Topic modeling algorithms are statistical methods that analyze the words of the 
original texts to discover the themes that run through them, how those themes are connected to each other, 
and how they change over time m- In this paper, we use one of the popular topic modeling technique called 
Latent Dirichlet Allocation (LDA) ITTIjT^] . 

LDA is a generative model for randomly generating observable data, given some hidden parameters [3]. 
LDA is a model for collection of discrete data such as text corpora. LDA provides probability distribution 
of topics over words in the corpus and probability distribution of documents over the discovered topics. In 
LDA, each document may be related to a mixture of various topics. LDA creates topics when it finds set of 
words that tend to co-occur frequently in the documents of the corpus. The plate notation of LDA is shown 
in Fig {2} 



Figure 2: Plate Notation of LDA 


In plate notation, the boxes are plates representing replicates. The outer plate represents documents 
which is denoted by M. while the inner plate represents the number of topics and words denoted by K and 
N respectively, which are distributed over documents [3]. Each document may be related to one or more 
topics and each word in the discovered topics will have its own probability. The calculated probability of each 
word w over topics z and probability of each topic over documents are represented by tp and 9 respectively. 
The a and /? are the parameters of Dirichlet distribution on per-document topic distribution and per-topic 
word distribution respectively. Here, their value is set to 0.01 and it can be set to any value between 0 and 
1 . 

3.3.1 LDA Implementation 

LDA uses Gibbs Sampling Algorithm m to infer the topics from the Stack Overflow data. Here we make 
use of The Stanford Topic Modeling Toolbox (TMT) [TS] for the implementation of LDA. 

3.3.2 Number of Topics 

The number of topics is denoted by K. It is a user specified parameter that controls the granularity of the 
discovered topics. If the value of K is large, it produces more detailed topics and if the value is small, it 
produces more general topics. Here we aim for topics of medium granularity, so we set K value to 10. 


4 Metrics and Analysis 

LDA discovers K topics, Z\,.... Zk- The distribution of a particular topic Zk in document di is denoted as 
0(di,Zk). Note that \/i,k : 0 < 9(di,Zk) < 1 and Vi : 9(di, Zk) = 1- Then, we define a threshold, S, to 

indicate whether a particular topics is ”in” a document. A document di can have between 1 to 5 dominant 
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topics each with memberships of 0. Thus, by using the 6 threshold as a membership cutoff, we keep only 
the main topics in each document and discard the probabilistic errors. 

4.1 Topic Share 

We define the overall share [H] of a topic z k across all posts as, 

share(z k ) = — ^ 0{d u z k ) (1) 

' ' dieD 

6(di,z k )>6 

where D is the set of all posts in our dataset. The share metric measures the proportion of posts that contain 
the topic Zk- For example, if a topic has a share metric of 10%, then 10% of all posts contain this topic. The 
share metric allows us to identify the major discussion topics in our Stack Overflow dataset. 

4.2 Topic Relationships 

We have to determine the relationship between topics in questions and topics in corresponding answers. We 
consider a discussion as a single question post along with its answer posts. We define the relationship rel 
m between two topics z q and z a in one discussion as, 

rel(z q , z a ) = ^2 9(di,z q ) x 6(dj,z a ) (2) 

dieQ,djeA(di) 

6{di ,z q )>6 
0(dj ,z a )><5 

where Q is the set of all question posts and A(di) is the set of all answer posts related to question di. 

4.3 Topic Trends Over Time 

To analyze the trends of topics, we define the impact |16| of a topic Zk in month m as, 

impact(zk,m) = ——rr V' 9(d z ,z k ) (3) 

Dim) 

1 V n di&D(m) 

where D(m ) is the set of all posts in the month m. The impact metric measures the relative proportion of 
posts related to that topic compared to the other topics in that particular month. 

5 Suggesting Tags 

Stack Overflow identifies each question posts with the tags which are given by the developers while posting 
the question. Unnecessary tags might be generated by users who does not have prior knowledge about the 
existing tags. To overcome this problem, when the user inputs a question, it is analyzed to find the latent 
topics and the tags are suggested based on those topics. 

6 Quality Analysis of Questions in Stack Overflow 

The number of questions added each month to Stack Overflow has been steadily growing since the start 
of Stack Overflow and it has reached the maximum of more than 200,000 new questions per month. The 
quality of text content provided by the Stack Overflow website may vary and ranges from good quality 
questions/answers to low quality questions/answers. 

In this work we have focused on different levels of quality of question posts in order to maintain the 
growth of Stack Overflow website. Since there are large amount of questions being posted each month, 
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some of the questions may be answered and some remain unanswered. Thus by identifying the quality of 
questions in Stack Overflow in each topic discovered by LDA, using the score value of each question post, we 
can discard the low quality questions from Stack Overflow. We also discard the question posts whose score 
value is 0. We assume that 0-scored questions have not attracted enough interests from the community of 
developers. After removing 0-scored question, our dataset contains 94,879 questions which we subdivide into 
three categories: 

1. Good Quality: Questions with accepted answers and score value greater than 7 fall into this category 

2. Medium Quality: Questions with accepted answer and score value between 1 to 6 fall into this category 

3. Low Quality: Questions with no accepted answer and score value less than 0 fall into this category. 

7 Results and Discussion 

The 10 topics discovered by LDA and some of its top LDA words are shown in Table [Tj 


Table 1: Topics and its Top Words 


Topic Names 

Top LDA words 

Topic 0 

memori thread alloc process pointer 

Topic 1 

except error log messag fail 

Topic 2 

think peopl question develop dai 

Topic 3 

web net java http framework 

Topic 4 

tabl sql queri insert updat 

Topic 5 

python perl rubi php script 

Topic 6 

string arrai declar argument list 

Topic 7 

svn repositori control branch chang 

Topic 8 

page button click css browser 

Topic 9 

session secur password login site 


We have 10 topics discovered by TMT tool from Stack Overflow posts data. The topics are arranged 
in ascending or descending order according to their share metric. Figure [3] shows the share graph for each 
topic. 

From Fig|3] it is observed that the Topic6 is the most discussed topic among the 10 discovered topics in 
Stack Overflow. Higher the share value, more the topic is being discussed in stack Overflow. 

Figure [4] and [5] projects the trend line of each topic and comparison between them based on impact values. 
These trend line indicates the rise or fall of interest in a particular topic on Stack Overflow. Table [2] shows 
the overall result which contains topics discovered by LDA, share value and the trend of each topic. 

Table [^[tabulates the results of tag suggestion. 

Figure[6] depicts variation in number of quality question for a particular score value, and FigjT] shows the 
variation in number of quality question for a particular topic, Topicl. 

Every user face their own problems during programming. So they tend to obtain help from programming 
Q&A websites like Stack Overflow. The posts belong to various topics and trend of these topics change over 
time, which needs to be analyzed before deciding if the posts are useful or not. Topic Modeling technique 
LDA can handle numerous latent topics that can be found in large set of data. The probability distribution 
obtained after applying LDA is reliable and gives expected results, which can be followed by analyzing the 
quality of questions in Stack Overflow and suggesting tags to the new questions. The proposed method help 
to eliminate the low quality question and at the same time suggest tags for a given question instead of simply 
adding more tags to Stack Overflow. This also helps to maintain the growth of Stack Overflow website. 
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Topics Share 



Topics 


Figure 3: Topics Share Graph 



Figure 4: Impact Of Topicl 


Table 2: Topics Shares and Trends 


Topic Names 

Share (%) 

Trend 

Topic6 

13.78 


Topic8 

11.56 

H- 

Topic9 

9.79 

It 

Topic5 

9.78 


Topic4 

9.66 


Topic3 

9.58 

it 

Topic7 

9.56 

it 

Topic2 

9.02 


Topicl 

8.14 

it 

TopicO 

7.96 
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Figure 5: Impact Of Two Topics 
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ure 6: Quality of Questions for particular Score Values 


Quality of Questions in Topicl 
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O Medium Quality Questions 
■ Low Quality Questions 


Figure 7: Quality of Questions in Topicl 



























Table 3: Tags Suggestion Result 


Question 

Topics 

Tags 

I have a Python program that 

Topic8, 

jquery,c#, 

works with dictionaries a lot. 

Topic7, 

python, 

I have to make copies of dic¬ 

Topicl, 

javascript, 

tionaries thousands of times. 

I need a copy of both the 
keys and the associated con¬ 
tents. The copy will be edited 
and must not be linked to the 
original (e.g. changes in the 
copy must not affect the orig¬ 
inal.) 

Topic5 

asp.net 

I would like to know how 

Topic6, 

mysql, 

to export in Arcgis a list of 

Topic8, 

sql- 

values calculated in python 

Topic4, 

server, 

script into one of the follow¬ 

Topicl 

java, 

ing data formats: csv, txt, 
xls, dbase or other. I would 
also like to know how to cre¬ 
ate such file in case that it 
doesnt exist. 


database 

I have multiple tables in a 

Topic4 

database, 

database.I need to have a 


sql, 

SQL statement in which i can 


mysql, 

get this values. How will i 
get this code in SQL using 
those join methods, im writ¬ 
ing it in VB6 ADODC, is this 
the same syntax in a standard 
SQL. 


c#, php 
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8 Conclusion 


In this paper, we proposed a method to identify the topics present in the Stack Overflow posts dataset. We 
also focused on the trend of those topics, how they change over time in order to understand the developer’s 
discussion in Stack Overflow. Topic Modeling technique, LDA is used in our method to discover topics from 
the textual content of Stack Overflow posts dataset. This is followed by technique to suggest tags for the 
upcoming question posts to eliminate the generation of new and unnecessary tags. By doing this we can 
avoid overwhelming of tags within Stack Overflow. 

Since the generation of new questions are increasing each month, we analyzed the quality of questions to 
avoid the questions that are unanswered. We consider three different levels of quality which are categorized 
to Good Quality, Medium Quality and Low Quality questions. Good Quality include questions which have 
accepted answer and score value grater than 7. Medium Quality include questions having accepted answer 
and score value between 1 to 6. Low Quality include questions having score value less than 0. As we 
increase/decrease the score value, we identify the variation in quality of questions in Stack Overflow. 

Our proposed method helps to discard the low quality questions and unwanted tags and also recommend 
tags to new questions. This further helps in maintaining the growth of Stack Overflow website by removing 
unnecessary data from the website. 
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